Remember that summarizing data is initially all about discovery, the heart of exploratory data analysis.
How we summarize depends on whether the data is discrete or continuous.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
What variables are continuous? What are their data types?
customer_data <- read_csv("customer_data.csv")
## Rows: 10531 Columns: 13 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (8): gender, married, college_degree, region, state, review_time, review... ## dbl (5): customer_id, birth_year, income, credit, star_rating ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
One common statistic for a continuous variable is a mean.
customer_data |> summarize(avg_income = mean(income))
## # A tibble: 1 × 1 ## avg_income ## <dbl> ## 1 138623.
Note that summarize() is more general than count() and can accommodate all sort of calculations - similar to mutate(). What is the main difference between summarize() and mutate()?
income and credit.customer_data |>
summarize(
avg_income = mean(income),
avg_credit = mean(credit)
)
## # A tibble: 1 × 2 ## avg_income avg_credit ## <dbl> <dbl> ## 1 138623. 667.
{ggplot2} provides a consistent grammar of graphics built with layers.
Let’s plot the distribution of income.
customer_data |> ggplot(aes(x = income)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Visualize the relationship between income and credit.
customer_data |> ggplot(aes(x = income, y = credit)) + geom_point()
Visualize the relationship between star_rating and income.
customer_data |> ggplot(aes(x = star_rating, y = income)) + geom_point()
## Warning: Removed 7372 rows containing missing values or values outside the scale range ## (`geom_point()`).
What do we do if there is overplotting? There’s a geom for that (geom_jitter()).
size and alpha geom arguments.geom_smooth() layer.region?customer_data |>
drop_na(star_rating) |>
ggplot(aes(x = star_rating, y = income)) +
geom_jitter(size = 3, alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ region) +
labs(
title = "Relationship Between Star Rating and Income by Region",
x = "Star Rating",
y = "Income"
)
## `geom_smooth()` using formula = 'y ~ x'
Grouped summaries provide a powerful solution for computing continuous statistics by discrete categories.
customer_data |>
group_by(gender) |>
summarize(
n = n(),
avg_income = mean(income),
avg_credit = mean(credit)
)
## # A tibble: 3 × 4 ## gender n avg_income avg_credit ## <chr> <int> <dbl> <dbl> ## 1 Female 5219 130685. 668. ## 2 Male 4214 146861. 666. ## 3 Other 1098 144735. 665.
Note how the group_by() function is a lot like the facet_wrap(), it filters the data by each category in the discrete group variable.
count() is a wrapper around a grouped summary using n().
customer_data |>
group_by(gender) |>
summarize(
n = n()
)
## # A tibble: 3 × 2 ## gender n ## <chr> <int> ## 1 Female 5219 ## 2 Male 4214 ## 3 Other 1098
customer_data |> count(gender)
## # A tibble: 3 × 2 ## gender n ## <chr> <int> ## 1 Female 5219 ## 2 Male 4214 ## 3 Other 1098
We can group by more than one discrete variable.
customer_data |>
group_by(gender, region) |>
summarize(
n = n(),
avg_income = mean(income),
avg_credit = mean(credit)
) |>
arrange(desc(avg_income))
## `summarise()` has grouped output by 'gender'. You can override using the ## `.groups` argument.
## # A tibble: 12 × 5 ## # Groups: gender [3] ## gender region n avg_income avg_credit ## <chr> <chr> <int> <dbl> <dbl> ## 1 Other Midwest 124 154637. 663. ## 2 Male Midwest 420 152467. 666. ## 3 Other Northeast 337 150564. 665. ## 4 Male Northeast 1285 150498. 665. ## 5 Male West 2079 149453. 667. ## 6 Other West 519 144420. 667. ## 7 Female Midwest 557 134083. 671. ## 8 Female West 2497 133819. 668. ## 9 Female Northeast 1602 133333. 669. ## 10 Other South 118 119068. 660. ## 11 Male South 430 117988. 669. ## 12 Female South 563 105888. 664.
We can also use slice_*() functions along with group_by().
customer_data |> group_by(gender, region) |> slice_max(income, n = 3)
## # A tibble: 36 × 13 ## # Groups: gender, region [12] ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 6119 1984 Female 315000 698. No Yes Midwest ## 2 1299 1992 Female 306000 610. No Yes Midwest ## 3 7139 1957 Female 302000 727. No Yes Midwest ## 4 1040 1993 Female 356000 672. Yes Yes Northeast ## 5 6503 1997 Female 348000 599. No Yes Northeast ## 6 11249 1992 Female 343000 620. No Yes Northeast ## 7 2075 1989 Female 374000 578. No Yes South ## 8 7128 1966 Female 301000 708. Yes Yes South ## 9 4756 1977 Female 293000 790. No Yes South ## 10 7366 1993 Female 376000 682. No Yes West ## # ℹ 26 more rows ## # ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>, ## # review_title <chr>, review_text <chr>
We often want to see how a variable changes over time: a time series. However, dates and times can be tricky.
customer_data |> ggplot(aes(x = review_time, y = star_rating)) + geom_line()
## Warning: Removed 7372 rows containing missing values or values outside the scale range ## (`geom_line()`).
There’s a package for that!
rating_data <- customer_data |> drop_na(star_rating) |> select(review_time, star_rating) |> mutate(review_time = mdy(review_time)) rating_data
## # A tibble: 3,159 × 2 ## review_time star_rating ## <date> <dbl> ## 1 2015-06-11 4 ## 2 2008-03-25 5 ## 3 2013-06-07 2 ## 4 2016-04-20 5 ## 5 2015-10-18 5 ## 6 2015-01-06 5 ## 7 2017-04-22 5 ## 8 2014-09-11 4 ## 9 2017-09-19 4 ## 10 2013-12-12 5 ## # ℹ 3,149 more rows
rating_data |> ggplot(aes(x = review_time, y = star_rating)) + geom_line()
Let’s summarize the data by a period of time and then plot the time series.
rating_data |> mutate(review_year = year(review_time)) |> group_by(review_year) |> summarize(avg_star_rating = mean(star_rating)) |> ggplot(aes(x = review_year, y = avg_star_rating)) + geom_line()
Just like there are geoms for visualizing continuous or discrete data, there are geoms for visualizing the relationship between continuous and discrete data.
customer_data |> ggplot(aes(x = income, y = gender)) + geom_boxplot()
customer_data |> ggplot(aes(x = income, fill = gender)) + geom_density(alpha = 0.5)
Visualize the relationship between income and credit.
gender to the color argument.size and alpha arguments.geom_smooth().region and gender?color aesthetic is set in geom_point()?theme_minimal().legend.position argument in theme().customer_data |>
mutate(income = income / 1000) |>
ggplot(aes(x = income, y = credit)) +
geom_point(size = 3, alpha = 0.5, aes(color = gender, )) +
geom_smooth(method = "lm", se = FALSE) +
facet_grid(gender ~ region) +
labs(
title = "Income and Credit by Region and Gender",
x = "Income (in Thousands)",
y = "Credit"
) +
scale_color_manual(
name = "Gender",
values = c("violet", "purple", "turquoise")
) +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
Summary
Next Time
Supplementary Material
Artwork by @allison_horst
In RStudio on Posit Cloud, create a new R script and do the following.
customer_data and store_transactions.